Objective: We exemplify classification methods using the Titanic case study. We use advanced methods like Random Forests and build these concepts from the ground up; no prior knowledge of Machine Learning is required.
To get the most out of this session, students should ideally have a foundational understanding of:
- Basic Probability: Understanding of random variables and distributions.
- Introductory Statistics: Familiarity with the concept of regression (linear models) and hypothesis testing.
- R Programming: Basic knowledge of the R syntax to follow the code implementation.
Classification describes the process of predicting a discrete label (category) based on input data.
Real-World Applications & Impact
Healthcare: Medical diagnosis (e.g., “Malignant vs. Benign”) - Saving lives through early detection.
Finance: Credit Scoring & Fraud Detection - Assessing risk and securing global transactions.
Technology: Spam filtering and sentiment analysis - Curating the digital experience.
Classical Titanic Dataset
Machine Learning Workflow
Decision Tree: A non-linear, tree-structured approach that recursively splits data based on feature values to form decision rules, offering high interpretability.
Random Forest: An ensemble method that builds multiple decision trees (a “forest”) and takes a majority vote for the final prediction.
More Learning
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.
Key Variables:
Survived: 0 = No, 1 = YesPclass: Ticket class (1st, 2nd, 3rd)Sex: Male/FemaleAge: Age in yearsSibSp: Number of siblings/spouses aboardParch: Number of parents/children aboardFare: Passenger fareQuestion: Unfortunately, there weren’t enough lifeboats for everyone on board. What sorts of people were more likely to survive?
Question: Unfortunately, there weren’t enough lifeboats for everyone on board. What sorts of people were more likely to survive?
1
The accuracy of an estimate \(\hat f(x)\) depends on reducible and irreducible error:
\[\text{E}(y - \hat f(x))^2 = |f(x) - \hat f(x)|^2 + Var(\epsilon)\]
Machine Learning (ML) uses dataset splits into training and test datasets to find the optimal model.
Split our dataset in two, one dataset for training and the other one for testing.
rf_rec <-
# Define classification variables
recipe(Survived ~ ., data = train) |>
# Missing value imputation
step_impute_median(all_numeric()) |>
# Factor handing: split into binary terms
step_dummy(all_nominal_predictors()) |>
# execute transformation
prep()
# Extract imputed training data
train_dt <- juice(rf_rec)
# Apply pre-processing to test data
test_dt <- bake(rf_rec, new_data = test) |>
as.data.frame()Bootstrap aggregating is a general-purpose procedure designed to improve stability and accuracy
1
Bagging trees Construct \(B\) trees using bootstrapped datasets. Each tree is grown deep and not pruned
Random forests select at each candidate split a random subset of features (\(m < \sqrt{p}\))
Truth
Prediction Survived Died
Survived 42 7
Died 27 103
AUC / ROC
Hint: Think about “Interpretability” vs. “Black Box”.
Logistic Regression: A linear model for binary classification that estimates the probability of a data point belonging to a particular class using the sigmoid function.
K-Nearest Neighbor: A simple, distance-based “lazy learner” that classifies data based on the majority class of its (k) closest neighbors.
Naive Bayes: A probabilistic classifier based on Bayes’ Theorem.
Gradient Boosting: Sequential ensemble techniques that build trees to correct previous errors.
Support Vector Machines: Finds the optimal hyperplane to maximize the margin between different classes.
Artificial Neural Networks: Deep learning models through interconnected layers, suitable for complex pattern recognition.
James, Witten, Hastie and Tibshirani. An Introduction to Statistical Learning (R/Python)
Kuhn and Silge. Tidy Modeling with R: A Framework for Modeling in the Tidyverse.